Skip to content

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113

Merged
coketaste merged 3 commits intodevelopfrom
coketaste/mad-system-gpu-arch
Apr 27, 2026
Merged

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113
coketaste merged 3 commits intodevelopfrom
coketaste/mad-system-gpu-arch

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

@coketaste coketaste commented Apr 24, 2026

Summary

  • Auto-detect GPU arch before build in full workflow (madengine run --tags): the build phase previously ran with build_only_mode=True and skipped GPU detection entirely, so
    Dockerfiles with ARG MAD_SYSTEM_GPU_ARCHITECTURE (no default) were built with an empty value. Users were forced to manually supply --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}' on every local run.

GPU arch auto-detection — design

Detection fires only in the RunOrchestrator._build_phase() path (full workflow). The flag is False by default so standalone madengine build is unaffected.

Detection chain reuses existing modules — no new code:

  1. detect_gpu_vendor() (utils/gpu_validator.py) — fast, filesystem-only check, no subprocess
  2. get_gpu_tool_manager(vendor, rocm_path) (utils/gpu_tool_factory.py) — singleton factory
  3. manager.get_gpu_architecture() (utils/rocm_tool_manager.py / nvidia_tool_manager.py) — cached, with actionable error messages
  4. normalize_architecture_name() (execution/dockerfile_utils.py) — consistent format

User-provided value in --additional-context is always respected and never overridden. Fails gracefully with a warning on nodes without a GPU.

Test plan

  • madengine run --tags <tag> on a local GPU node — build phase prints Auto-detected GPU architecture for build: gfx942, no "unresolved" warning, image builds correctly
  • madengine run --tags <tag> --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a"}}' — user value respected, auto-detect skipped
  • madengine build --tags <tag> on a non-GPU CI node — same behaviour as before (warning if Dockerfile needs arch, no crash)
  • madengine run --manifest-file build_manifest.json — run-only path unchanged
  • Existing unit tests pass: pytest tests/unit/ tests/e2e/

… full-run mode

In full workflow (madengine run --tags), the build phase ran with
build_only_mode=True and skipped GPU detection, leaving Dockerfiles that
declare ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default built with an
empty value. Users had to manually pass --additional-context every time.

- Context.__init__: add detect_local_gpu_arch param (default False),
  thread it to init_build_context()
- Context.init_build_context: add detect_gpu_arch param; when True,
  reuse detect_gpu_vendor() + get_gpu_tool_manager() + normalize_architecture_name()
  to detect and inject MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg
  before the build; user-provided value is never overridden; fails gracefully
- BuildOrchestrator.__init__: accept and forward detect_local_gpu_arch
  to Context; add resolved-arch confirmation print in execute()
- RunOrchestrator._build_phase: pass detect_local_gpu_arch=True so only
  the full-workflow build path auto-detects; standalone madengine build
  is unaffected (flag defaults to False)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@coketaste coketaste self-assigned this Apr 24, 2026
Copilot AI review requested due to automatic review settings April 24, 2026 04:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables auto-detection of MAD_SYSTEM_GPU_ARCHITECTURE during the build phase of the full madengine run --tags ... workflow so Dockerfiles that require this build-arg (without a default) can build successfully without forcing users to pass it via --additional-context.

Changes:

  • Enable GPU-arch auto-detection in RunOrchestrator._build_phase() by passing a new detect_local_gpu_arch=True flag into BuildOrchestrator.
  • Add detect_local_gpu_arch plumbing through BuildOrchestrator into Context and implement optional build-only GPU arch injection in Context.init_build_context().
  • Print a “resolved” message when MAD_SYSTEM_GPU_ARCHITECTURE is present in docker_build_arg before building.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/madengine/orchestration/run_orchestrator.py Turns on build-time GPU arch auto-detection for full build+run workflows by passing detect_local_gpu_arch=True.
src/madengine/orchestration/build_orchestrator.py Adds detect_local_gpu_arch parameter, forwards it into Context, and prints when the arch is resolved.
src/madengine/core/context.py Implements optional build-only detection of GPU arch and injects it into ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"].

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/orchestration/build_orchestrator.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/orchestration/run_orchestrator.py
Comment thread src/madengine/core/context.py
coketaste and others added 2 commits April 27, 2026 12:55
…ild_context

Five new tests in TestBuildContextGpuArchAutoDetect cover the detect_gpu_arch
path added in 78b5110:
- arch is injected when MAD_SYSTEM_GPU_ARCHITECTURE is absent
- user-provided value is preserved (no override)
- UNKNOWN vendor emits a warning and leaves the key unset
- detection exceptions are caught and warned, not raised
- detect_gpu_arch=False skips vendor detection entirely

Also introduces _make_build_only_ctx() helper to safely suppress
__init__'s init_build_context call via a scoped patch.object context
manager, avoiding the previous broken pattern of calling .stop() on
a MagicMock (which is a no-op and left the real method never invoked).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 20:50
@coketaste coketaste merged commit 8467319 into develop Apr 27, 2026
1 check passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


import pytest
from unittest.mock import Mock, patch
from unittest.mock import Mock, MagicMock, patch
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mock is imported but never used in this test module. This will trigger flake8 F401 (the repo has a flake8 pre-commit hook in .pre-commit-config.yaml). Remove the unused import or use it.

Suggested change
from unittest.mock import Mock, MagicMock, patch
from unittest.mock import MagicMock, patch

Copilot uses AI. Check for mistakes.
Comment on lines +298 to +302
resolved_arch = self.context.ctx.get("docker_build_arg", {}).get("MAD_SYSTEM_GPU_ARCHITECTURE")
if resolved_arch:
self.rich_console.print(
f"[green]✓ MAD_SYSTEM_GPU_ARCHITECTURE resolved: {resolved_arch}[/green]\n"
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build phase banner says “(Build-only mode - no GPU detection)”, but when detect_local_gpu_arch is enabled the build-only Context now performs GPU architecture detection (and these new lines print a resolved MAD arch). Consider updating the banner/message to reflect conditional detection so the output isn’t misleading.

Copilot uses AI. Check for mistakes.
Comment on lines 93 to +97
build_only_mode: Whether running in build-only mode (no GPU detection).
rocm_path: Optional ROCm installation path (overrides ROCM_PATH env; default /opt/rocm).
detect_local_gpu_arch: When True and in build_only_mode, attempt to auto-detect
MAD_SYSTEM_GPU_ARCHITECTURE from the local node and inject it into docker_build_arg.
Has no effect when build_only_mode=False (runtime mode detects it via init_gpu_context).
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build_only_mode arg is described as “no GPU detection”, but build-only mode can now still do GPU architecture detection when detect_local_gpu_arch=True. Consider updating this docstring wording so it matches the new behavior (e.g., “skips runtime GPU context; may optionally detect build arch”).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants